An Efficient Algorithm for Learning Distances that Obey the Triangle Inequality
نویسندگان
چکیده
Semi-supervised clustering of images has been an interesting problem for machine learning and computer vision researchers for decades. Pairwise constrained clustering is a popular paradigm for semi supervision that uses knowledge about whether two images belong to the same category (must-link constraint) or not (can’t-link constraint). Performance of constrained clustering algorithms can be improved if the supervision on some image pairs is used to modify the pairwise distances of other image pairs, for which no supervision is available. There are several excellent metric learning approaches when the image distances can be represented as Euclidean distances in vector spaces [6]. However, in many cases distances are computed robustly [3], on manifolds [4] or as the output of algorithms [1] or classifiers. These do not lead to a natural embedding in a metric space, and may not even obey the triangle inequality. We are particularly interested in semi-supervised clustering of fine-grained categories. If we look at the top 10 distance measures in two important domains (LFW faces and leaf shapes) related to fine-grained classification, we find that 80% of the methods use non-vector space distances. These distances can be used for clustering images, but are not suitable for existing metric learning algorithms. In order to propagate constraints from supervised to unsupervised pairs, some structure must be assumed on the set of possible distances. Otherwise, the distance between supervised pairs could be altered without affecting the distance between unsupervised pairs. Perhaps the weakest assumption that we can make about a distance is that it obeys the triangle inequality. Enforcing the triangle inequality allows us to propagate constraints; if a constraint alters one distance, other distances must also change to maintain triangle inequalities. For many interesting distances, the triangle inequality is not guaranteed to hold. However, we empirically find that the triangle inequality almost always holds for distances computed for fine-grained classification even when not explicitly enforced. This strongly motivates us to enforce the triangle inequalities when we alter distances to incorporate the pairwise constraints. We then find empirically that by enforcing the triangle inequality we can improve performance on several real world datasets. Our main contribution is to formulate distance learning with pairwise constraints as a metric nearness problem1 [2] and then provide an efficient algorithm to solve metric nearness for clustering. First, we formulate a quadratic optimization problem, where the pairwise distances between images are modified such that pairwise constraints and triangle inequality constraints are satisfied as much as possible. Since enforcing O(N3) triangle inequalities is computationally expensive, we propose a graph based approach, where only O(n(M +C)) triangle inequalities are sampled for use in the QP (N is the total number of images, n is the number of nearest neighbors in the n-nearest neighbor graph, M and C are the number of must-link and can’t-link constraints respectively). We empirically show that this sampling approach works well in practice. We use the distances obtained by our approach along with a constrained clustering algorithm [5] to achieve state-of-the-art clustering results. We theoretically analyze a simplified case in which only one pairwise constraint is present, to gain insights into our fast approach. Our sampling approach is based on the intuition that clustering is predominately affected by small distances, and is not sensitive to the exact value of larger distances. We prove that our sampling approach produces the same set of small distances that would be obtained by enforcing all constraints. We perform experiments on leaf and face image/video datasets and show that distances obtained by our method achieve state-of-the-art clustering results. Formulation: We begin with a set of N unlabeled images U (x ∈ U) from K classes. We are also provided with initial distances between all
منابع مشابه
Biswas, Jacobs: an Efficient Algorithm for Learning Distances
Semi-supervised clustering improves performance using constraints that indicate if two images belong to the same category or not. Success depends on how effectively these constraints can be propagated to the unsupervised data. Many algorithms use these constraints to learn Euclidean distances in a vector space. However, distances between images are often computed using classifiers or combinator...
متن کاملTriangle Fixing Algorithms for the Metric Nearness Problem
Various problems in machine learning, databases, and statistics involve pairwise distances among a set of objects. It is often desirable for these distances to satisfy the properties of a metric, especially the triangle inequality. Applications where metric data is useful include clustering, classification, metric-based indexing, and approximation algorithms for various graph problems. This pap...
متن کاملAccelerating Lloyd’s Algorithm for k-Means Clustering
The k-means clustering algorithm, a staple of data mining and unsupervised learning, is popular because it is simple to implement, fast, easily parallelized, and offers intuitive results. Lloyd’s algorithm is the standard batch, hill-climbing approach for minimizing the k-means optimization criterion. It spends a vast majority of its time computing distances between each of the k cluster center...
متن کاملThe p-Neighbor k-Center Problem
The k center problem with triangle inequality is that of placing k center nodes in a weighted undirected graph in which the edge weights obey the triangle inequality so that the maximum distance of any node to its nearest center is minimized In this paper we consider a generalization of this problem where given a number p we wish to place k centers so as to minimize the maximum distance of any ...
متن کاملGeneralized k-Center Problems
The k-center problem with triangle inequality is that of placing k center nodes in a weighted undirected graph in which the edge weights obey the triangle inequality, so that the maximum distance of any node to its nearest center is minimized. In this paper, we consider a generalization of this problem where, given a number p, we wish to place k centers so as to minimize the maximum distance of...
متن کامل